FISH: fast and accurate diploid genotype imputation via segmental hidden Markov model

نویسندگان

  • Lei Zhang
  • Yu-Fang Pei
  • Xiaoying Fu
  • Yong Lin
  • Yu-Ping Wang
  • Hong-Wen Deng
چکیده

MOTIVATION Fast and accurate genotype imputation is necessary for facilitating gene-mapping studies, especially with the ever increasing numbers of both common and rare variants generated by high-throughput-sequencing experiments. However, most of the existing imputation approaches suffer from either inaccurate results or heavy computational demand. RESULTS In this article, aiming to perform fast and accurate genotype-imputation analysis, we propose a novel, fast and yet accurate method to impute diploid genotypes. Specifically, we extend a hidden Markov model that is widely used to describe haplotype structures. But we model hidden states onto single reference haplotypes rather than onto pairs of haplotypes. Consequently the computational complexity is linear to size of reference haplotypes. We further develop an algorithm 'merge-and-recover (MAR)' to speed up the calculation. Working on compact representation of segmental reference haplotypes, the MAR algorithm always calculates an exact form of transition probabilities regardless of partition of segments. Both simulation studies and real-data analyses demonstrated that our proposed method was comparable to most of the existing popular methods in terms of imputation accuracy, but was much more efficient in terms of computation. The MAR algorithm can further speed up the calculation by several folds without loss of accuracy. The proposed method will be useful in large-scale imputation studies with a large number of reference subjects. AVAILABILITY The implemented multi-threading software FISH is freely available for academic use at https://sites.google.com/site/lzhanghomepage/FISH.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reconstructing Dna Copy Number by Penalized Estimation and Imputation.

Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani a...

متن کامل

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

MOTIVATION Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on stati...

متن کامل

A new multipoint method for genome-wide association studies via imputation of genotypes : Supplementary Methods

where Z i = {Z (1) i1 , . . . , Z (1) iL } and Z (2) i = {Z (2) i1 , . . . , Z (2) iL } are two sequences of hidden states at the L sites and Z il ∈ {1, . . . , N}. At a given locus these hidden states can be thought of as the pair of haplotypes in the set H that are being copied at that locus to form the genotype vector Gi. Here Pr(Z i , Z (2) i |H) defines the prior probability on how the seq...

متن کامل

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

MOTIVATION Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming...

متن کامل

اهمیت خویشاوندی ژنتیکی و رکورد فنوتیپی بر صحت ژنومی داده‌های جانهی شبیه‌ سازی شده با استفاده از مدل های حیوانی در حضور اثرات متقابل ژنوتیپ و محیط

The objective of this study was to investigate the role of genetic relationships between training and validation set with considering different ratio of phenotypic records of training set on accuracy of genomic prediction via animal models containing genotype × environment interactions in simulated imputation data. For this purpose, four different scenarios using 15k density containing differen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 30 13  شماره 

صفحات  -

تاریخ انتشار 2014